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Abstract 

The method of stable random projections [39, 41] is popular for data streaming computations, data mining, 
and machine learning. For example, in data streaming, stable random projections offer a unified, efficient, 
and elegant methodology for approximating the l a norm of a single data stream, or the l a distance between 
a pair of streams, for any < a < 2. [18] and [20] applied stable random projections for approximating 
the Hamming norm and the max-dominance norm, respectively, using very small a. Another application 
is to approximate all pairwise l a distances in a data matrix to speed up clustering, classification, or kernel 
computations. Given that stable random projections have been successful in various applications, this 
paper will focus on three different aspects in improving the current practice of stable random projections. 

Firstly, we propose very sparse stable random projections to significantly reduce the processing and 
storage cost, by replacing the a-stable distribution with a mixture of a symmetric a-Pareto distribution 
(with probability (3, < (3 < 1) and a point mass at the origin (with a probability 1 — j3). This leads to a 
significant -g-fold speedup for small (3. We analyze the rate of convergence as a function of (3, a, and the 
data regularity conditions. For example, when a = 1 and the data have bounded second moments, then if 
we choose (3 = D \ /2 , the rate of convergence would be O (l) -1 ' 2 ), which is fast even for moderate D. 
Here D is the data dimension. Some numerical evaluations are conducted, on synthetic data, Web crawl 
data, and gene expression microarray data. 

Secondly, we provide an improved estimator for recovering the original l a norm from the projected data. 
The standard estimator is based on the (absolute) sample median [39, 19], while we suggest using the 
geometric mean. The sample median estimator is difficult to analyze precisely or non-asymptotically. The 
geometric mean estimator we propose is strictly unbiased and is easier to study. Moreover, the geometric 
mean estimator is more accurate, especially non-asymptotically. When a — ► 0+ (as considered in 
[18, 19, 20]), even asymptotically, the geometric mean estimator is still about 27% more accurate in terms 
of variances. In addition, we show that, when a — 0+, the maximum likelihood estimator has a simple 
form and is considerably more accurate than both the geometric mean and sample median estimators. 



Thirdly, we provide an explicit answer to the basic question of how many projections (samples) are needed 
for achieving some pre-specified level of accuracy. [39, 19] did not provide a criterion that can be used 
in practice. The geometric mean estimator we propose allows us to derive sharp tail bounds which can 
be expressed in exponential forms with constants explicitly given. From these tail bounds, an analog 
of the Johnson-Lindenstrauss (JL) Lemma for dimension reduction in l a follows: It suffices to use k = 

O ^ l ° s i^ ^ projections to guarantee that the l a distance between any pair of data points (among n data 

points) can be estimated within a 1 ± e factor with high probability, using the proposed geometric mean 
estimator. 



1 Introduction 



Stable random projections[39, 41] can be applied at least in two types of applications: approximating all pairwise 
distances and data stream computations. 



1.1 Computing All Pairwise Distances 

We start with a "data matrix" A e R nxD , with n rows and D columns (n data points in D dimensions). For example, 
A can be the term-by -document matrix at Web scale. Many applications require computing all pairwise distances of 
A, the exact computation of which would cost 0(n 2 D), infeasible at Web scale. Broder and his colleagues [14, 12] 
developed various versions of the min-wise sketching algorithm[13, 16] to approximate all pairwise resemblance 
distances, for syntactic clustering of the AltaVista Web crawls and for removing duplicate pages. 

We do not have to use the resemblance distance. Instead, we could use some l a distance, < a < 2. Given two data 
points u\ and u 2 in D dimensions, the individual l a norms and the l a distance are 

/ d \ V« / D \ V« / d \ V<* 

(X>mI q J > (X> 2 .*i"J > d w = (EKi-^rJ . (i) 

The idea of stable random projections[39, 41] is to multiply the original data matrix A <G M™ x£> with a non-adaptive 
random projection matrix R e R Dxk sampled i.i.d. from an a-stable distribution [60] , resulting in a projected matrix 
B = AR e R" xfc . If k is fixed and small (we will show precisely how small k can be.), then the cost for computing 
all-pairwise distances will be reduced from 0(n 2 D) to just 0(n 2 k + nDk), provided we can estimate the original l a 
distances in A from B. 

Given two data points u\, u 2 (two rows in A), we denote the corresponding projected vectors by v\, v 2 <E K fc , i.e., 
vi = R T «i, v 2 — R T w 2 . We recommend the following estimator to reconstruct d( Q ), the distance between ui and u 2 : 



IlLi \vi,j - v 2 j 



l/k 



(a),gm some correction factor ' ^ 
which is the geometric mean with corrections. Alternatively, [39, 19] proposed using the (absolute) sample median: 

1 _ median-Qij - v 2 ,j\,j = 1,2, ...,fc} 
v ; some correction factor 

which is a special case of sample quantile estimators [30, 31, 52]. The sample median estimator d( a ). me is not as 

accurate, especially when k is not too large. Moreover, the theoretical analysis on rf( a ), me is not as convenient, 
especially non-asymptotically. 

We will show that using rf( a ) l9m , it suffices to choose k = O ^ lo ^") ^ so that the l a distance between any pair of data 
points in A can be estimated within a 1 ± e factor with high probability. 



1.2 Data Stream Computations 



In data stream computations[38, 32, 39, 7, 18], stable random projections can be used at least for (A): approximating 
the l a frequency moments for individual streams; (B): approximating the l a differences between a pair of streams; 
(C): approximating the number of non-zero items (the Hamming norm) in a stream using very small a[18, 19]. Here 
we only consider < a < 2; but we should mention that a > 2 is also sometimes computed[4]. 

Massive data streams are fundamental in many modern data processing applications. Data streams come from Internet 
routers, phone switches, atmospheric observations, sensor networks, highway traffic conditions, finance data, and 
more[38, 32, 39, 7, 18]. Unlike in the traditional databases, it is not common to store massive data streams; and hence 
the processing is often done "on the fly." For example, in some situations, we only need to "visually monitor" the data 
by observing the time history of certain summary statistics, e.g., sum, number of distinct items, or any l a norm. 

If we are only interested in the sums (or the range-sums), the DLT priority sampling algorithm would be ideal, es- 
pecially for positive data[26, 27, 3]. Although the priority sampling algorithm had been proposed and implemented 
commercially for a few years, only recently [58] proved the non-asymptotic variance (bound) and the best constant. 



Note that the sum of positive items is the l\ norm. [39] described the procedure to use Cauchy (which is 1-stable) 
random projections for approximating the l\ norms (or l\ differences) of general data streams. For a data stream, u\, 
which contains pairs (i, u\ t i), i € {1,2, D}, [39] suggested the following steps (at least for the ideal version): 

• Choose k = O (^)- Initialize Vij = 0, j = 1, 2, k. 

• Generate a matrix R e R Dxk , with entries nj i.i.d. samples of standard Cauchy. 

• For each new pair (i, u\ t i), modify vij = v\j + TijUi^, for each j — 1, 2, k. 

• Return medianjlui^il, |ui,2|, |^i,fc|} as the approximate li norm of u\. 

[18, 19] extended the above procedure to general < a < 2. [39, 18, 19] did not provide a practical criterion for 
choosing the sample size k. 

1.3 Comparing Data Streams Using Hamming Norms 

[18, 19] proposed approximating the Hamming norms of data streams using stable random projections with small a, 
because the l a norm raised to the ath power approximates the Hamming norm well if a is sufficiently small. The 
Hamming norm gives the number of non-zero items present in a single stream; and it is also an important measure 
of (dis)similarity when applied to a pair of streams [18, 19]. Note that for static data, one could approximate the 
Hamming norms directly by applying 2-stable (i.e., normal) random projections on the binary-quantized (0/1) data. 
[18, 19] considered the dynamic setting in that the data may be subject to frequent additions/subtractions. 

In order to well approximate the Hamming norm, [18, 19] let < a < e/\og(U), where U is the largest item (in 
absolute values) in the stream(s). [18, 19] considered that, if an estimator, say d, approximates the truth, d, within a 
factor of 1 ± e, then d a will be within (1 ± e) a factor of d a . Our concern is, because a is very small, (1 ± e) a m 1 ± ea. 
If a = e, then we will end up with a 1 ± e 2 factor instead of the usual lie factor we like to have. 

In this study, we will provide strictly unbiased, geometric mean types of estimators for both the l a norm and the l a 
norm raised to the ath power, as well as their tail bounds. For the case a — > 0+, we will compare in detail the 
geometric mean estimator with the sample median estimator, as well as the maximum likelihood estimator. Very 
interestingly, the maximum likelihood estimator in this case has a simple convenient form, whose variance is only 
about one half of the variance of regular normal (l 2 ) random projections. In other words, stable random projections 
with very small a would be indeed ideal for approximating the Hamming norms, not only in the dynamic settings but 
also preferable in static data. 

1.4 Which Norm (a) to Use? 

a = 2 is the most thoroughly studied case [59]. When a = 2, we could directly estimate the original l 2 distances from 

the projected l 2 distances. The Johnson-Lindenstrauss (JL) Lemma says we only need k = ^ lo ^") (constants are 

well-known) so that the (squared) l 2 distance between any pair of data points can be estimated within a 1 ± e factor 
with high probability. Many different versions of the JL Lemma have been proved[43, 35, 42, 5, 22, 39, 40, 1, 6, 2]. 

The case a = 1 is also very often encountered in practice. However, it has been proved by [10, 45, 1 1] that one can 
not hope to develop an estimator that is a metric for dimension reduction in li without incurring large errors. 

Other norms are also possible. In data streaming computations, as a increases, the l a distance attributes more signif- 
icance to a large individual component; and therefore varying a provides a tunable mechanism[34]. This argument 
applies directly also in the machine learning content. As a concrete example, [15] proposed a family of non-Gaussian 
radial basis kernels for SVM in the form of K(x, y) = cxp (— pJ2i \ x i ~~ Vi\ b )' f° r data points x and y. (Here b is 
our a.) [15] showed that b = 0.5 in some cases gave better results in histogram-based image classifications. 

The l a norm with a < 1 is now well-understood to be a natural measure of sparsity[24, 25]. Of course, this is why 
[18, 19] approximate the Hamming norm with the l a norm using small a. [20] adopted the similar idea to approximate 
the max-dominance norm in data streams using very small a. 

1.5 Paper Organization 

The paper is organized as follows. Section 2 briefly reviews stable random projections and summarizes our main 
results. We will then introduce very sparse stable random projections in Section 3. We will study in Section 4 the 
estimators and tail bounds for recovering both the l a norm and the l a norm raised to the crth power. Section 5 compares 



the proposed geometric mean estimator with the sample median estimator, particularly for the case a — > 0+. Section 
5 also proposes a bias-corrected maximum likelihood estimator for the case a — > 0+. Section 6 concludes the paper. 

2 Review of Stable Random Projections and Summary of Main Results 

A random variable x is symmetric a-stable if its characteristic function can be written as E (exp {yf— Ixi)) = 
exp (—d a \t\ a ), where d > is the scale parameter. We write x ~ S(a,d), which in general does not have a 
closed-form density function except for a = 2 (normal) or a = 1 (Cauchy). 

The basic fact: if z\, z 2 , zd, i-i-d. are S(a, 1), then for any constants (i.e., the original data) g\, g 2 , gu, we have 

d ( ( d \ 1 ^ a \ D 

Si=i 9i z i ~ ^ ( a ; ( Si=i lff*| Q ) ) • That is, the projected data J2i=i 9* z i a ^ so f°ll° w an a-stable distribution 

with the scale parameter being the l a norm of the vector [g\, g 2 , goY ■ 

Therefore, given two vectors m, u 2 € M. D (e.g., u\ and u 2 are the leading two rows in the data matrix A), if 
v\ = R T ?ii and v 2 = R T U2, where the entries of R e K £>xfc , nj, are i.i.d. samples of S(a, 1), then xj — vij — v 2 ,j, 
j = 1,2, k are i.i.d. S(a, d( a )), where d^ is the l a distance between u\ and u 2 . 

Thus, the problem eventually boils down to estimating the scale parameter of S(a,di a \), from k i.i.d. samples. 
Estimators based on the maximum likelihood, which are asymptotically (as k — > oo) optimal, are computationally 
very intensive except for a = 2, 1, 0+; and hence they are not practical for many applications. We recommend the 
estimators based on the geometric mean which are computationally convenient and still quite accurate, especially 
when a is around 1. Moreover, the geometric mean estimators are convenient for theoretical analysis, e.g., variances 
and tail bounds. 

Our main contributions include (A): Very sparse stable random projections; (B): Estimators and tail bounds for stable 
random projections. 

2.1 Very Sparse Stable Random Projections 

We suggest replacing the entries, S(a, 1), in the projection matrix R with the following i.i.d. entries 

!P a with prob. ^ 
with prob. 1 - /3 , (4) 
— P a with prob. ^ 

where P a denotes an a-Pareto variable, Pareto(a, 1), Pr (P a > t) = if t > 1; and otherwise. The projected 
data Y^f=\ 9i r ij w ^ be asymptotically stable under certain regularity condition. This procedure is beneficial because 

• It is much easier to sample from an a-Pareto distribution than from S(a, 1). 

• Computing A x R costs only 0{f3nDk) as opposed to 0(nDk), a i-fold speedup, where the data matrix 

A e w ixD . 

• The storage (of R) cost is reduced from O(Dk) to 0(/3Dk). 

We will give the conditions for convergence and the rates of convergence as functions of a, (3, and the data regularity 
conditions. Two "easy-to-remember" statements are: 

• In order for very sparse stable random projections to converge, the data should have at least bounded ath mo- 
ments. 

• When the data have bounded second moments and a < 1, we can let (i = and the rate of convergence will 
be at least O (Z) -1 / 2 ), which is fast even for moderate D. 

We notice that [39, 19] had suggested using (4) with (3 = 1 as the standard practice, without showing the convergence 
conditions and the rates of convergence. 

Non-asymptotic analysis on very sparse stable random projections is difficult even for (3 — 1. Therefore, whenever 
we discuss about estimators and tail bounds, we assume that we are using regular stable random projections. 



2.2 Estimators and Tail Bounds 

Here we only present the estimator for d/ a \, the l a distance between u\ and u 2 € M. D , from the projected data 
difference, Xj = (vij — v 2 j) ~ S(a, rf( a )), j = 1, 2, k. Our proposed estimator is based on the geometric mean: 



n fe - \ x j\ 1/k i 

c ^«'i om = — ri Xj ~ SYa, rfr„i)i i.i.d., k> — (5) 

i), S m is unbiased, i.e., E (d( a ), flm ) = d( a ). 

The correction term can be pre-computed for small fc. For large fc, we have the asymptotic formula 



7T 



exp I - 7e I 1 - - I I , (6) 



Q 



where 7 e = 0.577215665..., is the Euler's constant. It converges from above monotonically. 
The variance is (valid for k > — ) 

l! 

• The tail bounds are 

e 2 \ I 



Vara, mMlO^l^HK A-f (^( l + ±-\ + o(^\\ (7) 

Var [d (aUm j - d (a) fc - 1 - d (a) I 6fc I 2 a 2 J U 2 ' ' 



, , e > 0, fc > - (8) 
~jTf~~ \ 0<e< l,fc>fc > -, (9) 



Pr (d(a),sm - d(a) < -«*(<*)) < CXp ^-fc 

where M R ^ a ^ and M £ ct ^ fco are explicitly given. 

Using d( a ). gm , it suffices to let fc = O ^ lo g(") ^ so that the Z Q distance between any pair of data points (or data 

streams) among n data points (or data streams) can be estimated within a 1 ± e factor, with high probability. The 
constant can be determined from M R ^ a ^ and M £ ct ^ fco . 



3 Very Sparse Stable Random Projections 

We suggest a procedure to simplify stable random projections and significantly reduce the processing and storage cost. 

Recall the basic fact about stable distributions: If z\, z 2 , Z£>, i.i.d. are S(a, 1), then for any constants (i.e., the 

original data) g lt g 2 , go, we have Yh=i 9i z i ~ S (a, (^£=i Iftl") 1 ^) ■ That is > the projected data Yh=i 9i z i 

also follow an a-stable distribution with the scale parameter being the l a norm of the vector [gi,g 2 , 5d] t - 

However, it is expensive to sample from S(a, 1), if a ^ 1 or 2. For example, [55, Proposition 1.71.1] describes 
a popular procedure for sampling from S(a, 1). That is, we first sample W\ uniform on (— f , § ) and E\ from an 
exponential distribution with mean 1 . If W\ and E\ are independent, then 

sin(aWi) /cos ((1 - a)W x ) 
cos(VFi) 1 /" ^ E[ 

is distributed as S(a, 1). Apparently, this procedure is quite costly. 

The procedure for conducting stable random projections is also quite expensive. For example, the cost of matrix 
multiplication A x R would be 0(nDk), where A e R nxD is the data matrix and R e M Dxfe is the projection 
matrix consisting of i.i.d. samples of S(a, 1). 

There is also a considerable storage cost for R. There are at least two reasons why we need to store R. Firstly, in 
some scenarios, we need to consider that new data points (or data streams) will be added to the dataset. Secondly, in 
data stream computations, the data entries do not necessarily arrive in orders[39]. In fact, the data may be also subject 
to frequent additions/subtractions [18, 19]. The cost of storing R is O(Dk). 




To tackle the above issues, we suggest replacing S(a, 1) with the following 



P a with prob. | 

z t = { with prob. 1 - j3 , (11) 
-P a with prob. £ 

where P Q denotes an a-Pareto distribution, Pareto(a, 1). That is, Pr (P a > t) = ^ if t > 1; and otherwise. 

We call this approach very sparse stable random projections because on average only f3 fraction of the entries are 
non-zeros, i.e., a -^-fold speedup in computing A x R, from 0{nDk) down to O(finDk). The storage cost is reduced 
from O(Dk) to 0{(3Dk). 

There are two fundamental reasons why this approach should work: 

• The data should satisfy certain regularity conditions otherwise the l a norms may not be meaningful. For exam- 
ple, when using the li norm, implicitly we expect that the data have at least bounded first moments. 

• The data dimension D should be very large, otherwise there would be no need for approximate answers. 

We are inspired by the recent work on very sparse random projections for dimension reduction in I2 [48], which 
showed the regularity condition for convergence and rate of convergence using known statistical theorems: the Linde- 
berg central limit theorem and the Berry-Esseen theorem. In our case, we also need to analyze under what conditions 
very sparse stable random projections will converge, as well as the rates of convergence. The necessary and sufficient 
condition for convergence is known (e.g., [36]), for both i.i.d. and non-i.i.d. (independent but not identical) cases. The 
rates of convergence for the i.i.d. case are also known, see [33] and a recent paper[44]. In our case, since we have 
to deal with the non-i.i.d. scenario, we will resort to the first principle, i.e., by studying the characteristic function of 
E£=i 9i z i> m Lemma 1 (proved in Appendix A). 

Lemma 1 Suppose z it i = 1,2, D, are i.i.d. random variables defined in (11). Then as D — > 00, 



(/3Ef =1 lff S | Q )' 

provided 



S ^a, — a) cos (^ a )) ^ i distribution, (12) 



max (\gi\) 

Ki<D 



(Etils.h)' 



0. (13) 



The rate of convergence is 



(14) 



There is no need to consider a = 2 because we can sample from normals or the sparse distribution suggested in [1, 48]. 
Note that (13) is only a convenient sufficient condition. 

Lemma 1 is not very interpre table. For convenience, we will assume that the data gi's are i.i.d. Suppose the data have 
bounded second moments, we have the following corollary (can be shown by strong law of large numbers). 

Corollary 1 Suppose \gi\'s are i.i.d. with bounded second moments, then the convergence condition (13) is satisfied. 
If we choose (3 = ^j=, then the rate of convergence is O (D~ mm (M/a:-i/2)^ 

In other words, if the data have bounded second moments (not a very strict condition), we can achieve a significant 
Vl?-fold speedup and the rate of convergence is still reasonable if a is not close to 2. For example, the rate is 



O (Z)^ 1 / 2 ) when a = 1. In practical applications, because D is very large, a rate O (Z) -1 / 2 ) should be fast enough. 
On the other hand, when a is approaching 2, then the rate of convergence will be very slow (if converges at all) even 
if we let (3=1. Therefore, we do not recommend replacing the stable distribution with Pareto when a is close to 2. 

Next, we will consider the case when the data do not have bounded second moments or even first moments. To simplify 
the arguments, we assume the data \gi\'s are i.i.d. and follow an Ty-Pareto distribution with rj < 2. Recall if a random 
variable x follows an r/-Pareto distribution, then E(x 7 ) < oo if 7 < 77 and E(a; 7 ) = 00 if 7 > 77. 

Heavy-tailed data (without bounded second moments) are usually modeled by Pareto-type distributions. [53] measured 
the r] values for many kinds of datasets. While it is quite often that 1 < 77 < 2, it is not very common that r\ < 1. 
For example, [53] measured the word frequency has r\ = 1.2, which is the well-known highly heavy-tailed case. The 
frequency of family names has 77 = 0.94 and the intensity of wars has 77 = 0.8, both not too far from 1. 

Corollary 2 Suppose \gi\'s are i.i.d. -q-Pareto with 77 < 2, then the convergence condition (13) is satisfied if t] > a. 
Assuming 77 > a, the rate of convergence would be 



' O ( p2/a .^ 2/a . 2/n ) , if 1 < a < 2 

O (max I D2 _ m J (1 , 2a/v) , }) , if < a < 1 



(15) 



If we choose (3-D 2 -° , then the rate of convergence would be 

O ( pvi-x/J , if 1 < a < 2 

(16) 

O (max { £|2 _ max 1 (1 , 2et/t , ) , D1 /Li/„ }) > if 0<a<l 

Proof: This corollary can be shown by the fact that if Xi is rj-Pareto, i.i.d., then J2iLi x l g rows as O (fl mai ( 1 ^/i)). 
See [28, Example 2.7.4]. 



Therefore, in order for very sparse stable random projections to converge (for any < (3 < 1), we have to make sure 
that the original data should have at least bounded ath moment. This is a very natural requirement. When the data 
have bounded higher moments, we can obtain a faster rate of convergence and afford a smaller (3. 

We can see that if D is larger enough (say 10 5 ), it is quite easy to achieve a 10-fold or 100-fold speedup. A factor of 
100 (or even 10) may be significant enough to make a theoretically appealing algorithm become a practical one. 

Our numerical studies show that very sparse stable random projections work really well (probably more than what we 
would expect). In the next three subsections, we will present some numerical results on the synthetic data, some Web 
crawl data, and the Harvard microarray data, respectively, for the l\ case (a = 1). 

3.1 Numerical Results on Synthetic Data 

We simulate data from Pareto(ri, 1) for 77 = 1.5 and rj = 2.0. We choose a = 1.0 (i.e., the l\ norm). Corollary 2 

recommends (3 — D~ 2 -° , which is respectively D" 1 / 3 and D^ 1 / 2 for 77 = 1.5 and 77 = 2.0. To make the results 
more interesting, we choose (3 = 4 and D~ ' 75 , respectively, otherwise all curves will simply overlap. 

We generate data for D ranging from 100 to 10 6 and apply very sparse stable random projections (J3 = D~ 0A and 
D -0 - 75 ) for k ranging from 10 to 100. We then estimate l\ norms using the geometric mean estimator we will discuss 
in Section 4, as if the projected data were exactly stable. The mean square errors (MSE's) are presented in Figure 1, 
which only plots D = 100, 500, and 1000 because the curves corresponding to larger ZD's overlap. The results indicate 
that very sparse stable random projections work really well even when D is not too large. 

3.2 Numerical Results on Web Crawl Data 

We apply very sparse stable random projections on some MSN Web crawl data. We pick two pairs of words, THIS- 
HAVE, and SCHOOL-PROGRAM. The data dimension D = 2 16 = 65536. For each word, the ith entry (i = 1 to D) 
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Figure 1: We simulate data from Pareto(ri, 1) for?/ = 1.5 (a) r/ = 2.0 (b), respectively. The mean square errors (MSE) 

are plotted against the sample size k for each 13=100, 500, and 1000. The "theoretic" curves were the theoretical 

variances assuming the data are exactly (instead of asymptotically) stable (see Lemma 2). 



is the number of occurrences this word appeared in the ith Web page. It is well-known that the word frequency data 
are highly heavy-tailed. The pair THIS-HAVE are frequent words while the pair SCHOOL-PROGRAM are relatively 
infrequent. Some summary statistics are given in Table 1, which verifies that the data are indeed highly heavy-tailed, 
especially the pair SCHOOL-PROGRAM. 



Table 1: For each pair of words (ui v.s. it 2 ), we compute the difference vector u = U\ — «2 and ratios of the empirical 
moments for illustrating that the data are highly heavy-tailed. 



Sparsity (ui) Sparsity (tta) Sparsity (it) 



E(|u| 2 ) E(|«H 



THIS (ui) - HAVE (1x2) 0.4226 0.2674 0.4378 9.97 239.81 

SCHOOL (ui) - PROGRAM (u 2 ) 0.0695 0.0816 0.1279 51.77 4076.30 



For each pair, we estimate the l\ distance using very sparse stable random projections with /3 = 0.1, 0.01, and 0.001. 
The results are presented in Figure 2. For the pair THIS-HAVE, even when (3 — 0.001, the results are indistinguishable 
from what we would obtain by exact stable random projections. For the pair SCHOOL-PROGRAM, when /3 = 0.01, 
the results are good. However, when j3 — 0.001, we see considerably larger errors. This is because the data are sparse 
(sparsity = 0.1279, meaning that the "effective data dimension" should be much smaller than D — 65536) and the data 
are highly heavy-tailed. Note that -j= = 0.039, = 0.0118. 
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(a) (b) 
Figure 2: The MSE's (normalized by the l\ distances) of very sparse stable random projections for two pairs of words. 

The "theoretic" curves were the theoretical variances assuming the data are exactly stable (see Lemma 2). 



3.3 Numerical Results on Classifying Microarray Data 



Usually the purpose of computing distances is for the subsequent tasks such as clustering, classification, information 
retrieval, etc. Here we consider the task of classifying deceases in the Harvard microarray dataset[9]'. The original 
dataset contains 203 samples (specimens) in 12600 gene dimensions, including 139 lung adenocarcinomas (12 of 
which may be suspicious), 17 normal samples, 20 pulmonary carcinoids, 21 sqamous cell lung carcinomas, and 6 
SCLC cases. We select the first three classes (in total 139 + 17+20 = 176 samples) as our test dataset. For each 
specimen, we subtract the median (across genes). However, we did not perform any normalization. 

A simple nearest neighbor classifier can classify the samples almost perfectly using the l\ distance. When m = 1, 3, 
5, 7, 9, the m-nearest neighbor classifier mis-classifies 3, 2, 2, 2, 2, samples, respectively. For comparisons, using the 
Q2) correlation distance (i.e., 1 - correlation coefficient), when m = 1, 3, 5, 7, 9, the m-nearest neighbor classifier 
mis-classifies 6, 4, 4, 4, 5, samples, respectively. 

We conduct both stable (i.e., Cauchy) random projections and very sparse stable random projections (f3 — 0.1, 0.01, 
and 0.001) on the dataset and classify the specimens using a 5-nearest neighbor classifier based on the li distances. 
Figure 3 indicates that (A): stable random projections can achieve similar classification accuracy using about 100 
projections (as opposed to the original D = 12600 dimensions); (B): very sparse stable random projections work well 
when (3 = 0.1 and 0.01. Even with (3 = 0.001, the classification results are only slightly worse. 
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Figure 3: We apply stable (Cauchy) random projections and very sparse stable random projections (/3 = 0.1, 0.01, and 
0.001) on the Harvard microarray dataset (n — 176 specimens in D — 12600 dimensions) and classify the specimens 
using a 5-nearest neighbor classifier based on the l\ distances. The horizontal dashed line indicates that, using the 
exact l\ distances, a 5-nearest neighbor classifier mis-classifies 2 samples. The other curves show that using stable 
random projections we can achieve almost the same misclassification errors with just 100 projections (as opposed 
to the original 12600 dimensions). Very sparse stable random projections with j3 = 0.1 and 0.01 perform almost 
indistinguishably from Cauchy random projections. Even when (3 — 0.001, the results are only slightly worse. Each 
curve is averaged over 100 runs. 

3.4 A More General Scheme for Sparse Stable Random Projections 

Instead of using the sparse distribution in (1 1), we could, alternatively, consider the following more general form: 

fP a .f, with prob. § 
with prob. 1-/3 , (17) 
-P a ,n with prob. | 

where P a ,n denotes Pareto{a, /i). That is, Pr (P Q . M > t) = if t > /it; and otherwise. 

Theoretically, (17) is appealing. When we choose a smaller (than 1) /z, e.g., /i = 1/D 7 for some 7 > 0, we can 
actually achieve a faster rate of convergence and a less restrictive condition for ensuring convergence. 

However, as /x decreases, the probability density function (PDF) of P a ,fj, becomes more steep near /1 (as shown in 
Figure 4), i.e., harder to obtain random samples of good quality. 

'http: // re search. dfci . harvard . edu/meyer sonlab/lungca/ files /DatasetA_12 600gene . txt 
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Figure 4: The probability density functions of P Q . M for a = 1 and // = 1, 0.5, and 0.1. 



To conclude this section, we should point out that non-asymptotic analysis on very sparse stable random projections 
is difficult. For the rest of the paper, we will assume that we use the regular stable random projections. 

4 The Geometric Mean Estimators and Tail Bounds 

We present the results for estimating and d" a y where <i( Q ) is the l a distance between u\ and u 2 £ K D , from the 

projected data v\ and v 2 - v\ = R T ui and v 2 — R T it2, where R £ IR' Dxfc have i.i.d. entries r,j ~ S(a, 1). We always 
denote Xj = vi j — v 2 j, j = 1, 2, k, Xj ~ S(a, <ir a )). 

We develop two sets of estimators and tail bounds, one for estimating and another for estimating d^ a y 
4.1 The Geometric Mean Estimator and Tail Bounds for d( a \ 



Our proposed geometric mean estimator df a \ gm for di a \ is based on the following fact about S(a, di 
Proposition 1 Suppose x ~ S (a, dr a -\). Then for — 1 < A < a, 



a)) 



2^ A 



Proof Although not explicitly stated in [60], one can infer this result by combining various statements in [60] (page 
63, page 116, 117. Note the typos in (2.1.9) and (2.1.10)). 

For the sake of verification, we derive E for < A < a, by completing the result stated in [55, Property 1.2.17, 

page 18], which says 



2 A-i r (l-A) 



g(N*)=4» n 0<A<a. (19) 



We can find the explicit expression for the integral[37, 3.823, page 484], as 

By Euler's reflection formula, T(l — z)Y(z) = sin ^^ we know T{— A) = — sin ^^ r(A+i) ' ana ' tne desired result 
follows after some algebra. 



We need the above result for A < a in order to derive the proposed unbiased estimator. We will need to consider A < 
for proving two-sided tail bounds. 

From (18), we can design an unbiased estimator for di a \ as (by taking A = 1/fc) 



nLiN 1/fc i 

d( a )am — r, Xj ~ S((x, dt a \ ) , i.i.d., fc>— . (21) 

C),s [fr Q)r(i-^) sin (|i)r ia)> 



The denominator in d( a y gm can be pre-computed and it converges to a fixed value depending on a. We call this 
estimator the geometric mean estimator although it is really the geometric mean with corrections for both k and a. 

We prove some useful properties of di ol \ gm in the following Lemma (see the proof in Appendix B) 
Lemma 2 The estimator, rf( a ) lSm , as defined in (21 ) 

• It is unbiased, i.e., E {d^ a ^ gm ^ — d( a y 
oo, 



As k 



r i 



i 

koi 



IT 1 

2 k 



exp -7, 



where j e = 0.577215665..., is the Euler's constant. It converges from above monotonically. 
The variance ofd( a ^ gm is, provided k > ^, 



Var 



(3 \_ * f [fr(I)r(i-^)sin( 7 ri)] fc 
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(Q) \ 6fc V 2 



1 



O 



(22) 



(23) 



In Lemma 2, the asymptotic properties of d( a )_ gm are convenient (and more interpretable) when we derive the tail 

bounds for di a \ gm . The monotonicity property will also be used a couple of times. These interesting asymptotic 
properties are proved by careful Taylor expansions and using the infinite-product representations of the Gamma func- 
tion and the sine function. 

In Appendix C, we prove Lemma 3 for the tail bounds of d( a ),gm- 
Lemma 3 The right tail bound 
Pr (d 



1(a), gin 



d(a) > ed( a ) 



) < exp ^~ 



k— , e > 0, k> -, 

M Rtate J a 



where 
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Mr a e 



7T 2 /l 



c a a 



— log(l + C ) - _ log -r 1 - — T - sin -- - — 7e 1 - - 
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a' 



The left tail bound 

Pr (d( a ), gm - < -erf(a)) < exp ^~ 



M. 



L,a,e,feo 



< e < l,fc > ko > -, 
a 



where 
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ML,a,e,k 



1 log(l-e)-llogf--r| I 



e \ . / 7T e 
I sm 

Ca / V 2 C c 



fc log 



7T 



(24) 

(25) 
(26) 

(27) 



(28) 



Note that in (28), fc log([fr (j^j T (l - sin (|^)]) converges to - 7e (l - 1); and hence it is well 
behaved. Restricting k > ko in the left tail shall not raise a concern. Recall our goal is to show that k = O ( lo ^" ; 



suffices, which is usually not too small. We could adjust k to match Mr, u ^ and Mn. a ^^ so mat we can nave a 
convenient "symmetric" bound Pr (\d( a ^ gm — d( Q ) | > erf( Q )^ . We showed this for a = 1 in a technical report [47]. 

We consider the techniques (in Appendix C) for deriving the tail bounds in Lemma 3 are interesting: 



• First we note that di a ^ gm , which only has moments up to k, does not have a moment generating function; and 
hence we can not use the popular Chernoff bound [17]. However, we can always use the Markov moment bound. 
[54] has shown that the moment bound is always sharper than the Chernoff bound for positive random variables, 
even when the Chernoff bound does exist. 

• To get the optimal moment bound in our case is difficult (unless a — 1). We resort to sub-optimal (but asymp- 
totically optimal) bounds by realizing that d/ a ^ gm can be treated as a gamma random variable when k is large 
enough and e is small. For a gamma, we know its optimal moment bound[54, Example 3.3]. The "gamma 
approximation" is due to the central limit theorem for large k. For positively-skewed variables, it is usually a 
good idea to use gamma rather than normal as long as both have the same asymptotic first two moments [49]. 

The right tail bound in Lemma 3 is only "pseudo-exponential," because the constant A/r a e depends on e. However, 
for a given e, no matter how large it is, we can always find the upper bound in an exponential form. From a practical 
point of view, what really matters is that the constants should be as small as possible. Indeed, as illustrated in Figures 
5 and 6, Mr^^ and Mi jQ e fc are reasonably small. 




(a) (b) 
Figure 5: We plot the right tail bound constant M R t a,t in Lemma 3 for a certain range of a and e. 




Figure 6: We plot the left tail constant ML, a ,e,k in Lemma 3 for a certain range of a and e, at fc = 100. 

A direct consequence of the tail bounds in Lemma 3 yields the following JL-like Lemma, which is weaker than the 
classical JL lemma, because the estimator d( a y gm is not a metric. 

Lemma 4 Using the estimator d( a ), ffm > for any fixed e > 0, we only need k = O ( l ° s ^ ^ to guarantee that the l a 

distance between any pair of points among n data points can be estimated within a factor of 1 ± e. Moreover, the 
constant can be explicitly characterized. 

4.2 The Geometric Mean Estimator and Tail Bounds for d? 

(a) 

Sometimes we are more interested in d? a -, than c?( Q ). For example, the classical JL Lemma for I2 is presented in terms 
of the squared I2 distance. [18, 19] approximated the Hamming norm using the l a norm raised to ath power, with 



very small a. Obviously we could raise the estimator (e.g., di a \ gm ) to the ath power, as treated in [18, 19]. 
concerned about the tail bounds, because a (1 ± e) a factor becomes a 1 ± e 2 factor when a = e. 

We can again design an unbiased estimator for d" a ^ , discussed in the next lemma. 
Lemma 5 The following estimator, denoted by E( a ), 
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7, fc>l. 
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is unbiased, i.e., E (jZ( a ), gm ^ = d" a y 
As k — > oo, 
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decreasing monotonically with increasing k. 
The variance ofEi a \ would be, (k > 2), 
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The right tail bound 
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We use the same technique for proving Lemma 5 as for proving Lemma 3; and hence we skip the proof. 

As illustrated in Figure 7, For the reasonable range of e (e.g., < 0.5), our tail bounds produce sensible results even 
when a is extremely small (e.g., 0.001). 

4.3 Fisher Efficiency of the Geometric Mean Estimators 

The geometric mean estimators are numerically straightforward, however, they are not optimal. The maximum like- 
lihood estimators (MLE's) are asymptotically optimal, as the sample size increases to infinity. The Fisher efficiency 
of a proposed estimator is defined to be the asymptotic ratio of the variance of the corresponding MLE to the variance 
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Figure 7: We plot the constants Gn. a ^ and GL, a ,e,k (&0 = 100) f° r the tail bounds in Lemma 5. The curves 
correspond to a = 0.001, 0.01, and from 0.1 to 2 at an increment of 0.1. 



of the proposed estimator. For stable distributions, evaluating the MLE's are in general numerically very intensive 
because the probability density functions have to be numerically calculated, except for a = 2, 1, 0+. We have men- 
tioned previously that there is no need to use the geometric mean estimator when a = 2. The technical report [47] 
discussed in detail the MLE when a = 1. We will discuss the case a — > 0+ in Section 5, which will report that the 
Fisher efficiency of the geometric mean estimator is = 0.6079 asa-> 0+. 

[51] numerically evaluated the Fisher information (reciprocal of the asymptotic variance of the MLE) for a range of 
a values (0.2 < a < 2) and commented on the numerical difficulty for smaller a. Figure 8 shows that the geometric 
mean estimators are > 80% efficient when a = 0.4 ~ 1.1. In other words, in the range of a = 0.4 ~ 1.1, the 
geometric mean estimators are close to be optimal. 




a 



Figure 8: The Fisher efficiency of the geometric mean estimators based on the Fisher information reported in [51] and 
the asymptotic variances of the geometric mean estimators. The single point at (0,0.6079) is based our calculation in 
Section 5. Note that since the Fisher efficiency concerns only the asymptotic ratio of variances, the Fisher efficiency 
would be the same for both the geometric mean estimator for d( Q ) and the geometric mean estimator for d? a y 



5 Comparing Estimators (Especially when a —> 0+) 

The case a — ► 0+ is interesting and practically useful. For example, [18] and [20] applied stable random projections 
for approximating the Hamming norm and the max-dominance norm, respectively, using very small a. 

We will compare our proposed geometric mean estimator with the sample median estimator, which is a special case 
of sample quantile estimators [30, 31, 52], for the case when a — ► 0+. In general, the geometric mean estimator is 
considerably more accurate when the sample size k is not very large. Asymptotically (as k — > oo), the geometric 
mean estimator and sample median estimator is equivalent when a = 1 (see the technical report[47]); but for any 
other specific a ^ 1, we find the geometric mean estimator is always also more accurate asymptotically. At the end 



of this section, we will introduce a better estimator based on the maximum likelihood estimator, which has a simple 
form when a — > 0+. 

As shown by [21] (or see [46, 20]), if x <~ S(a, 1) and a — ► 0+, then converges to 1/E\, where E\ stands for an 
exponential distribution with mean 1. This result is quite obvious by taking limit of (10), the procedure for sampling 
from S(a, 1). 

The median of 1/E\ is l/log(2). Therefore, given Xj, j — 1, 2, k i.i.d. S(a, when a — > 0+, the sample 

median estimator of d" a ^ would be equivalent to 

j. median {zj, j = 1, 2, /c} 

"■me = 7~7"j j (37) 

l/log(2) 

where Zj's are i.i.d. <~ h/E\, h = lim d" a y i.e., the hamming norm. In comparisons, the geometric mean estimator 
for d? s, derived in Section 4.2, becomes 

/v. - w. - A [|r(f)r ^ )sin(ff)r - FfiV 

We know the distribution of Zj ~ h/E^ exactly: 

Pr(zj < t) = exp(-h/t), Pt(zj =t) = exp(-h/i)^, t > 0. (39) 

It is easy to show that for the geometric mean estimator, 

V» (V.) - *" (^f|C - l) = *' i) + O (£) ^ (40, 

The asymptotic variance of the sample median estimator h me can be shown using known statistical results on sample 
quantiles [56, Theorem 5.10]. 

^ ( Hme ) = 4Pr 2 (z J= Vlog(2))(l/log(2))2fc + ° (p ) = io?(2) fe + ° (p ) ' (41) 

Therefore, asymptotically the ratio of the variances ^ Jr (^""=) _ ^ 2 J^^) ~ 1-27- In other words, asymptotically, our 

proposed geometric mean estimator is about 27% more accurate than the sample median estimator when a —> 0+. 

Non-asymptotically, however, the geometric mean estimator can be much more accurate when k is not too large. The 
moments of the sample median estimator T me can be written in an integral form. 

t s ( ( \\\ m ( \\\ m ( l > \l(2m+l)! 



E (*"") = h ' f W(3T ( exp B)) (' - ^ ("?)) cxp ("?) 



t / t 2 (to!) 2 



1 log* (2) ^ „ (2m +1)! 



=/l * / ? i uL ^-n" f n2 df . ( 42 ) 
Jo (-logW) s ( m! ) 2 

based the properties of sample quantiles (see [56, Example 2.9]). For convenience, we only consider k = 2m + 1. 
When s = 1 and 2, this integral can actually be expressed as finite (binomial-type) summations [37, 4.267.41, 4.268.5], 
which, however, are numerically unstable when m > 12. 

We can compare the estimators in terms of their mean square errors (MSE's). The MSE of h me is oo if k < 5 and it is 
about 3.26 times (which is very considerable) as large as that of h gm if k = 5. The ratio of their MSE's, converges to 
about 1.27 as k — > oo, as illustrated in Figure 9. 

The above analysis reveals why it is not convenient to study the sample median estimator, even the probability density 
function is explicitly available. 

It is even more difficult to analyze the tail bounds for the sample median estimator, especially if we want the explicit 
constants. The basic result on tail bounds was from [29], which derived the "pseudo-exponential" tail bounds for the 
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o gm ). The horizontal (dashed) line is the theoretical asymptotic value (i.e., 
1.27). The solid curve is obtained from simulations (5 x 10 5 samples for each k). The thin dashed curve (for k < 25), 
which overlaps with the solid curve, is the theoretical MSE ratios obtained by theoretically integrating (42). 



Figure 9: Ratios of the MSE (h me v.s. h g 



sample quantiles with un-specified constants. Our technique for deriving tail bounds for the geometric mean estimator 
may be also applicable to the sample median estimator. That is, we can use the Markov moment bound together with 
the asymptotic properties of the sample median to derive sub-optimal (but asymptotic optimal) tail bounds, if we do 
not mind bounds in double integral forms. There are "distribution-free" bounds for the moments of order statistics 
[23], but only when the data have bounded first and second moments. 

On the other hand, since we can analyze the geometric mean estimator fairly easily, which is also more accurate, there 
seems to be no need to struggle for the exact moments and bounds for the sample median estimator. 



Finally, we will discuss about the maximum likelihood estimator (MLE), which is asymptotically optimal. For stable 
distributions, MLE is in general very expensive because we will have to numerically evaluate the probability density 
function, except when a = 2, 1, 0+. The technical report [47] discussed about the MLE for a = 1. Here, we will 
discuss the case a = 0+, which is in fact particularly simple for the MLE. 

Recall asa-> 0+, we have k i.i.d. samples zj ~ h/E\. It is easy to show E(— ) = 4, which implies a straightforward 
estimator 

hmie = ^—J- ~- ( 43 ) 

k ^.7 



i m ie is indeed the maximum likelihood estimator. We recommend the bias-corrected version 

t i ( r 
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The moments of h m i e and h m i eyC are analyzed in the following lemma. 
Lemma 6 The first two moments of the maximum likelihood estimator (43) are 



kj \k 2 
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(44) 



h[l + r)+o(^\ (45) 



Var ^) = h \k + w) +0 {-k?)- (46) 

The first two moments of the bias-corrected maximum likelihood estimator (44) are 

E (Krie.c) =h + (47) 

Va i h ^)= h2 ( 1 k + i) +0 (w)- (48) 



Figure 10 verifies the theoretical asymptotic variance formula for h m i e ,c- 
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Figure 10: We plot the empirical MSE of the bias-corrected maximum likelihood estimator h m [ e c (5 x 10 5 samples at 

each k) together with the theoretical asymptotic MSE of h m i e , c up to the third order, i.e., + -p-) (here we consider 
h = 1.) We can see that as soon as k > 7, our theoretical asymptotic formula is very accurate. The thick dashed 
curve is the theoretical variance of the unbiased geometric mean estimator d gm , indicating that h m le,c is always more 
accurate, both asymptotically and non-asymptotically. 



The Fisher efficiency of the geometric mean estimator as a — > 0+ would be 1/ f 2g- J = 0.6079. For this case, the MLE 
solution is actually simpler than the geometric mean estimator. Also, notice that if use the regular normal (I2) random 
projections on the binary-quantized (0/1) data, the estimation variance would be (see [59]), which is about twice 

as large as Var (h m le,cj ■ This implies that stable random projections with very small a not only provide a solution to 

approximating the Hamming norms in dynamic settings (i.e., data are subject to frequent additions/subtractions), but 
also are preferable even in static data. 



6 Conclusion 



Stable random projection is a very useful tool for various applications, such as approximating all pair-wise l a (0 < 
a < 2) distances and data stream computations. 

In this study, we propose very sparse stable random projections, to simply the sampling, to speedup the processing 
time (i.e., matrix multiplication), and to reduce the storage cost. As shown both theoretically and empirically, it is 
evident that we can achieve a very significant improvement without hurting the accuracy when a is less than 1 or a is 
not too larger than 1 . 

We analyze in detail the estimators based on the geometric mean for recovering the original l a norms from the projected 
data. The geometric mean estimators are computationally simple and fairly accurate especially when a is around 1 . 
Moreover, the geometric mean estimators allow us to study the moments precisely and derive practically useful tail 
bounds in explicit exponential forms, which are otherwise difficult to obtain (e.g.,) from the (commonly used) sample 
median estimator or the maximum likelihood estimator. An analog of the JL Lemma for dimension reduction in l a 

follows immediately from these exponential tail bounds: It suffices to use k = O ( lo ^") j projections so that the 

l a distance between any pair of data points (among n data points) can be estimated within a 1 ± e factor with high 
probability, using our proposed geometric mean estimator. 

The geometric mean estimators are particularly useful when solving the maximum likelihood estimators is compu- 
tationally expensive (i.e., when a / 2,1, or 0+). Even when a = 1, the maximum likelihood estimator requires 
solving a high-order polynomial nonlinear equation (see the technical report [47]); and hence it is still considerably 
more expensive than the geometric mean estimator. 



The special case a. — > 0+ is both practically useful and theoretically interesting. In this case, asymptotically, the geo- 
metric mean estimator is 27% more accurate (in terms of variances) than the sample median estimator, and much more 
non-asymptotically. However, in this case, the (biased-corrected) maximum likelihood estimator can be expressed in 
a very simple convenient form and is considerably more accurate than the geometric mean estimator, both asymptoti- 
cally and non-asymptotically. More interestingly, the variance of the maximum likelihood estimator is only about one 
half of the variance of the regular normal (I2) random projections. Therefore, stable random projections with very 
small a not only provide a solution to approximating the Hamming norms in dynamic settings (i.e., data are subject to 
frequent additions/subtractions), but also can be preferable even in static data. 
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A Proof of Lemma 1 

Let Cj = — ]-■ to show the convergence in distribution, it suffices to show the convergence of the Character- 
ful ISal")" 

istic function, i.e., 

log (e ^exp (^ltf3- 1/a J2 CA ) ) ) ~ r ( 1 - a ) cos (| a ) 1*1" (49) 
By our definition of Zi in (11), we have 

E (exp( v /z Tz l i)) =1-0 + TT-^dx 

Ji x 



l+Q 

=l-0 + a0[ / ^-rf dx - / y -r+ dx + / —r^dx 

-^TT^dx — i. Using the integral formula [37, 3.823, page 484], we obtain 

jfss^F^—iw^i -«)«(§«). (si) 

Also, by the Taylor expansion, 

J x l +° dX ~ 22-a 4!4-a " ^ 

Combining the results, we obtain 

E (exp(V=T*t)) =l-0 + 0{l- |t|«T(l - a) cos (|a) + \ ^- g + ...) 

= 1 - 0\t\«T{l - a) cos ga) + ^ J^!_ + ... (53) 
The above steps are similar to [44]. Once we know E (exp(\/^T^f)), we can express 

log (e ^exp (^10- 1/a c ^ ) ) 



= £ (-N a l*l a r(l - «) cos ga) + f '^^'1 - ±|« W (r(l - a) cos ga 



(50) 



If max (\cA) — > 0, then 
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The rate of convergence is determined by the next higher order term, which could be either the \t\ 2 term or the \t\ 2a 
term, depending on a, (3, and the data. The rate of convergence is 
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This completes the proof of Lemma 1 . 

B Proof of Lemma 2 

The estimator defined in (21) 
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is obviously unbiased, i.e., E (d( a ),gm^ = d{a)> because Xj's are i.i.d. S(a, rf( a )) with 
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The variance of d( a ^ gm is then 
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It remains to show the asymptotic behaviors of d( a ) tgm and Var (d( a ),grnj ■ First, we will show 
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where 7 e = 0.577215665..., is the Euler's constant. 
By Euler's reflection formula, T (1 — z) T(z) = . J 
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We use the infinite-product representation of the Gamma function[37, 8.322, page 944], 
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where ci = 5 (l — -^1), = — — ^r)> c 3 = 1 (l — jjt)> ■••> are the coefficients from the Taylor expansion of 
the logarithms. Note that I? = IP an ^ J2^Li 5*TT < S^Li 77 < 00 for t > 2. We can then write 
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It involves tedious algorithm to check the monotonicity, i.e., \-T (£) V (l — J-) sin (f r)l decreases with increas- 
ing fc. Our approach is to also use the infinite-product representation of the sine function [37, 1.431.1, page 44], 
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so that we could express the whole [|r(^)r(l — j^j) sin (-|^)] as an infinite- 



product. Then it suffices to show that, for any s > 1, 
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is monotonically decreasing. This is a much easier problem. We can take its first two derivatives of (62) and check 
that the first derivative is monotonically increasing with increasing k, reaching when k = 00. Therefore, the first 
derivative of (62) is negative; and hence we have proved the monotonicity. 



Finally, we need to show the asymptotic variance di a \ gm 
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Again, we use Taylor expansions. First, by Euler's reflection formula and some algebra, we obtain 
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For 2/ in r ) k 2 \ 7 we again resort to the infinite-product representation of the Gamma function. Some algebra yields 
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and this completes of the proof of Lemma 2. 

C Proof of Lemma 3 

We first show the right tail bound 
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First, we note that di 



n? = ii^i 1/fc 



only has moments less than fc. Therefore, we have to use the 



{a) - 9m [|r(i)r(i- I fc) s in( f i) J 
Markov moment bound instead of the more often used Chernoff bound [17]. As a matter of fact, for positive random 
variables, e.g., d( a ) ;9m , the moment bound is always sharper than the Chernoff bound [54, 50]. 

Applying the moment bound, 
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Usually, we will then have to find the optimal t that minimizes the upper bound. This is difficult in our case, because 
the optimization will involve the Gamma functions. We avoid this problem by a trick. 

We know how to choose the optimal t for a gamma random variable[54, Example 3.3]. di ol \ gm in fact resembles 

a gamma. As k increases, d[ a ). gm approaches a normal asymptotically, but it can be better characterized by an 

asymptotically equivalent gamma (i.e., with the same mean and variance), because d( a y gm is positive and has non- 
zero odd central moments. 



Suppose z is gamma with mean /j, and variance a 2 , then Pr (z > (1 + e)/i) < J0^r~t^ which is minimized 



at 



t = 1 + l^frj when t is restricted to be integers. This can be inferred directly from [54, Example 3.3]. Since the 

2 

moment bound is true for any t > 0, we could for convenience choose t — e^. 



Now, if we assume that d^ gm is a gamma random variable with mean d^ and variance d 2 a ^ (i 
convenience, we consider the asymptotic variance), then we can infer the "optimal" t value to be 
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(68) 



which is, of course, indeed only sub-optimal because d( a ^ gm is not a gamma non-asymptotically. However, because 
the moment bound (67) holds for any t > 0, we can plug in t — -j-k and obtain 
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In the above derivation, we have used the result from Lemma 2, which says, if k > —, then 
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Next, we will show the left tail bound 
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For any < t < k, 

Pr (d( a ), gm < (1 - e)^(a)) 
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Again, we need to find the optimal t. To avoid the difficulty, we "borrow" the sub-optimal t = -^-k, from the right tail 
bound. The rationale is that, if d( a )_ gm were symmetric about d( a y then the optimal t values will be the same for both 
tails. In our case, di a \ gm is positive skewed with the skewness decreases with increasing k. Using t = —k will be 
conservative but should not deviate too much from the true optimum when k is not too small. 

Plugging in t — ^-k,we obtain, after some algebra, 
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We have proved in Lemma 2 that k log ([-T(^)r(l — sm (§](:)]) — > — 7 e (l — ^) monotonically from above 
if A: > -k Therefore, if fc > /c , then < 



; and hence we have completed the proof. 



D Proof of Lemma 6 



Assume z <~ h/E\, where E\ stands for the exponential distribution with mean 1 . The log likelihood, l(z;h), and first 
three derivatives (w.r.t. h) are 

l{z;h) = -- z +\og{h)-2\og{z) (75) 

l'{h) = \~- (76) 
h z 

l"(h) = ~ (77) 

l'"{d) = A. (78) 

The MLE /i m / e is asymptotically normal with mean /i and variance ^^y, where 1(h), the expected Fisher Information, 
is 



l = l(h)=E(-l"(h)) = -. 
General formulas for the bias and higher moments of the MLE are available in [8, 57]: 
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where, after re-formatting, 

[12] = E(l'f +E(l'l"), [1 4 ]=E(0 4 , [1 2 2] -E(/"(?') 2 )+E(?') 4 , 

[13] = E(Z') 4 + 3E(?"(/') 2 ) + E(i'r), [l 3 ] - E(l'f. (82) 

Note that, for any integer m > 0, 

E0^ =jj Q exp(-^j^dz = J s m exp(-s)ds = m s™" 1 exp(-s)rfs = ml, (83) 
from which it follows that 

E(0 3 = ~ E(i'l") = 0, E(0 4 = ^, E(i"(0 2 ) = -^, E(/r')=0. (84) 

Hence 

[12] = -|, Hl4, [1*2]=*, [13] = A, [l 3 ] = - Jj. (85) 

Thus, we obtain 

E ( h mle)=h+l + 0^ (86) 

f- \ h 2 4h 2 / 1 \ 

Var M = T + ^ + U) (87) 

Because /i m ; e has O (|) bias, we recommend the bias-corrected estimator 

h m le,c — h m i e ^1 — , (88) 

whose first two moments are 

E (hmle,c) =h + (j^j (89) 

This completes the proof of Lemma 6. 
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